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Abstract 

The analysis of execution paths (also known as software traces) collected from a given software 
product can help in a number of areas including software testing, software maintenance and program 
comprehension. The lack of a scalable matching algorithm operating on detailed execution paths 
motivates the search for an alternative solution. 

This paper proposes the use of word entropies for the classification of software traces. Using a 
well-studied defective software as an example, we investigate the application of both Shannon and 
extended entropies (Landsberg-Vedral, Renyi and Tsallis) to the classification of traces related to 
various software defects. Our study shows that using entropy measures for comparisons gives an 
efficient and scalable method for comparing traces. The three extended entropies, with parameters 
chosen to emphasize rare events, all perform similarly and are superior to the Shannon entropy. 
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I. INTRODUCTION 



A software execution trace is a log of information captured during a given execution run 
of a computer program. For example, the trace depicted in Figure [1] shows the program 
flow entering function fl; calling f2 from fl; f2 recursively calling itself, and eventually 
exiting these functions. In order to capture this information, each function in the software 
is instrumented to log both: entry to it and exit from it. 

1 fl entry 

2 | f2 entry 

3 I I f2 entry 

4 | | f2 exit 

5 I f2 exit 

6 fl exit 

FIG. 1. An example of a trace 



The comparison of program execution traces is important for a number of problem areas 
in software development and use. In the area of testing trace comparisons can be used to: 1) 
determine how well user execution paths (traces collected in the field) are covered in testing 



25]; 2) detect anomalous behavior arising during a component's upgrade or reuse 22]; 

s jl5,Q, Q; 4) determine redundant test cases executed by one or 



3) map and classify defects 

M e test tean, fl; and 5) priority test cases (to ***** execution path cove r a g e with 



a minimum number of test cases) [12|, |23J . Trace comparisons are also used in operational 
profiling (for instance, in mapping the frequency of execution paths used by different user 



classes) 



25] and intrusion analysis (e.g., detecting deviations of field execution paths from 



expectations) [20 ]. 

For some problems such as test case prioritization, traces gathered in a condensed form 
(such as a vector of executed function names or caller-callee pairs) are adequate 12]. However 
for others, such as the detection of missing coverage and anomalous behavior using state 



machines, detailed execution paths are necessary [5|, [22|. The time required for analyzing 
traces can be critical. Examples of the use of trace analysis in practice include: 1) a customer 
support analyst may use traces to map a reported defect onto an existing set of defects in 
order to identify the problem's root cause; 2) a development analyst working with the testing 



team may use trace analysis to identify missing test coverage that resulted in a user-observed 
defect. 

Research Problem and Practical Motivation: To be compared and analyzed, traces must 
be converted into an abstract format. Existing work has progressed by representing traces 
as signals 18], finite automata 22], and complex networks [3J. Unfortunately, many trace 
comparison techniques are not scalable jf], [24) . For example the finite-state automata based 
kTail algorithm, when applied to representative traces, did not terminate even after 24 hours 
of execution |5]; similar issues have been experienced with another finite-state automata 



24] . These observations highlight the need for fast trace matching 



algorithm, kBehavior 
solutions. 

Based on our experience, support personnel of a large-scale industrial application with 
hundreds of thousands of installations can collect tens of thousands of traces per year. 
Moreover, a single trace collected on a production system is populated at a rate of millions 
of records per minute. Thus, there is a clear need for scalable trace comparison techniques. 

Solution Approach: The need to compare traces, together with a lack of reliable and 
scalable tools for doing this, motivated us to investigate alternate solutions. To speed up 
trace comparisons, we propose that a given set of traces first be filtered, rejecting those that 
are not going to match with the test cases, allowing just the remaining few to be compared 
for target purposes. The underlying assumption (based on our practical experience) is that 
most traces are very different, just a few are even similar, and only a very few are identical. 

This strategy is implemented and validated in the Scalable Iterative-unFolding Technique 
(SIFT) 24]. The collected traces are first compressed into several levels prior to compar- 
ison. Each level of compression uses a unique signature or "fingerprint"!^. Starting with 
the highest compression level, the traces are compared, and unmatched ones are rejected. 
Iterating through the lower levels until the comparison process is complete leaves only traces 
that match at the lowest (uncompressed) level. The SIFT objective ends here. The traces 
so matched can then be passed on to external tools, such as the ones presented here, for 
further analysis such as defect or security breach identification. 

The process of creating a fingerprint can be interpreted as a map from the very high 
dimensional space of traces to a low (ideally one) dimensional space. Simple examples of 
1 The fingerprint of the next iteration always contains more information than the fingerprint of the previous 
iteration, hence the term unfolding. 
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such fingerprints are 1) the total number of unique function names in a trace; and 2) the 
number of elements in a trace. However, while these fingerprints may be useful for our 
purposes, neither are sufficient. The "number of unique function names" fingerprint does 
not discriminate enough - many quite dissimilar traces can share the same function names 
called. At the other extreme, the number of elements in the trace discriminates too much - 
traces which are essentially similar may nonetheless have varying numbers of elements. The 
mapping should be such that projections of traces of different types should be positioned 
far apart in the resulting small space. 

Using the frequency of the function names called is the next step in selecting useful 
;races. A natural one dimensional representation of this data is the Shannon information 



32J, mathematically identical to the entropy of statistical mechanics. Other forms of en- 



tropy/information, obeying slightly less restrictive axiom lists, have been defined 



These 



extended entropies (as reviewed in {?]]) are indexed by a parameter q which, when q — 1 
reduces them to the traditional Shannon entropy and which can be set to make them more 
(q < 1) or less (q > 1) sensitive to the frequency of rare events, improving the classifica- 
tion power of algorithms. Indeed, an extended Renyi entropy [29] with q = (known as 
the Hartley entropy of information theory) applied in the context of this paper returns the 
"number of unique function names" fingerprint. 

The entropy concept can also be extended in another way. Traces differ not only in 
which functions they call but in the pattern linking the call of one function with the call of 
another. As such it makes sense to collect not only the frequency of function calls, but also 
the frequency of calling given pairs, triplets, and in general /-tuples of calls. The frequency 
information assembled for these "/-words" can be converted into "word entropies" for further 
discriminatory power. In addition each record in a trace can be encoded in different ways 
(denoted by c) by incorporating various information such as a record's function name or 
type. 

Research Question: In t 



lis paper we study the applica 



Dility of the Shannon entropy [32 1 



and the Landsberg-Vedral 



19], Renyi [29], and Tsallis 34| entropies to the comparison and 
classification of traces related to various software defects. We also study the effect of q, /, 
and c values on the classification power of the entropies. 

Note that the idea of using word entropies for general classification problems is not new 
to this paper. Similar work has been done to apply word entropy classification techniques 



to problems arising in biology |8|, 13^], chemistry jg], analysis of natural languages 10], and 
image processing & In the context of software traces, the Shannon entropy has been used to 
measure trace complexity [14J ]. However, no one has yet applied word entropies to compare 
software traces. 

The structure of the paper is as follows: in Section [III we define entropies and explain the 
process of trace entropy calculation. The way in which entropies are used to classify traces 
is shown in Section II III Section [TV] provides a case study which describes and validates the 
application of entropies for trace classification, and Section [V] summarizes the paper. 



II. ENTROPIES AND TRACES: DEFINITIONS 



In this section we describe techniques for extracting the probability of various events 
from traces (Section III A I) and the way we use this information to calculate trace entropy 
(Section HlB]) . 



A. Extraction of probability of events from traces 

A trace can be represented as a string, in which each trace record is encoded by a unique 
character. We concentrate on the following three character types c: 

1. Record's function name (F), 

2. Record's type {FT), 

3. Record's function names, type, and depth in the call tree (FTD). 

In addition, we can generate consecutive and overlapping substrings^ of length I from 
a string. We call such substrings Z-words. For example, a string "ABCA" contains the 
following 2- words: "AB"; "BC"; and "CA". 

One can consider a trace as a message generated by a source with source dictionary 
A = {ai,a 2 , . . . ,a n } consisting of n /-words a iy i = 1,2, ...,n, and discrete probability 
distribution P = {pi,P2, ■ ■ ■ ,Pn}, where Pi is the probability Oj is observed. To illustrate 

2 The substring can start at any character i, where i < n — I + 1. 
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TABLE I. Dictionaries of a trace given in Figure [T] 



C l n A P 

F 1 2 fl, f2 1/3, 2/3 

F 2 3 fl-f2, f2-f2, f2-fl 1/5, 3/5, 1/5 

F 3 3 fl-f2-f2, f2-f2-f2, f2-f2-fl 1/4, 1/2, 1/4 

FT 14 fl-entry, fl-exit, f2-entry, f2-exit 1/6, 1/6, 1/3, 1/3 

FTD 1 6 fl-entry-depthl, fl-exit-depthl, f2-entry-depth2, 1/6, 1/6, 1/6, 

f2-exit-depth2, f2-entry-depth3, f2-exit-depth3 1/6, 1/6, 1/6 



these ideas, the dictionaries A and their respective probability distributions P for various 
values of c and / are summarized in Table [I] for the trace shown in Figure [TJ 

Define a function a that, given a trace t, will return a discrete probability distribution P 
for /-words of length I and characters of type c: 

P<-a(t;l,c). (1) 

The above empirical probability distribution P can now be used to calculate the entropy 
of a given trace for a specific /-word with type-c characters. For notation convenience we 
suppress the dependence of P (and the individual p^s) on the t, /, and c. We now define 
entropies and discuss how we can utilize P to compute them. 



B. Entropies and traces 



The Shannon entropy 



321 ] is defined as 

n 

H S (P) = -^2pihg b p h 



(2) 



8=1 



where P is the vector containing probabilities of the n states, and pi is the probability of 
z-th state. The logarithm base b controls the units of entropy. In this paper we set b = 2, 
to measure entropy in bits. 



Three extended entropies, Landsberg-Vedral 



191 ]. Renyi [29|, and Tsallis [34| are defined 



as: 



H L (P;q) 



l-l/Q(P;q) 
1-q : 



^(P; g) = M^and 



H T (P;q) 



1-q 

Q(P;g)-l 

1-q ■ 



(3) 



respectively, where q > is the entropy index, and 



n 



Q(P;q) = Y,Pl 



(4) 



These extended entropies reduce to the Shannon entropy (by L'Hopital's rule) when q = 1. 
The extended entropies are more sensitive to states with small probability of occurrence 
than the Shannon entropy for < q < 1. Setting q > 1 leads to increased sensitivity of the 
extended entropies to states with high probability of occurrence. 

The entropy Z of a trace t for a given I, c, and q is calculated by inserting the output of 
Equation ([!]) into one of the entropies described in Equations ([2]) and (J3]): 



where E G {L, R, T, S}. Note that if E = S then q is ignored, since in fact it must be that 



III. USING ENTROPIES FOR CLASSIFICATION OF TRACES 

A typical scenario for trace comparison is the following. A software service analyst 
receives a phone call from a customer reporting software failure. The analyst must quickly 
determine the root cause of this failure and identify if 1) this is a rediscovery of a known 
defect exposed by some other customer in the past or 2) this is a newly discovered defect. 
If the first case is correct then the analyst will be able to provide the customer with a 
fix-patch or describe a workaround for the problem. If the second hypothesis is correct the 
analyst must alert the maintenance team and start a full scale investigation to identify the 
root-cause of this new problem. In both cases time is of the essence - the faster the root 
cause is identified, the faster the customer will receive a fix to the problem. 

In order to validate the first hypothesis, the analyst asks the customer to reproduce the 
problem with a trace capturing facility enabled. The analyst can then compare the newly 
collected trace against a library of existing traces collected in the past (with known root- 
causes of the problems) and identify potential candidates for rediscovery. To identify a set 
of traces relating to similar functionality the library traces are usually filtered by names of 
functions present in the trace of interest. After that the filtered subset of the library traces 
is examined manually to identify common patterns with the trace of interest. 



Z <r- H E [a{t;l,c);q\ 



(5) 



q = l. 
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If the analyst finds an existing trace with common patterns then the trace corresponds 
to a rediscovered defect. Otherwise the analyst can concludjfl that this failure relates to a 
newly discovered defect. This process is similar in nature to an Internet search engine. A 
user provides keywords of interest and the engine's algorithm returns a list of web pages 
ranked according to their relevance. The user examines the returned pages to identify her 
most relevant ones. 

To automate this approach using entropies as fingerprints, we need an algorithm that 
would compare a trace against a set of traces, rank this set based on the relevance to a trace 
of interest, and then return the top X closest traces for manual examination to the analyst. 
In order to implement this algorithm we need the measure of distance between a pair of 
traces described in Section II II Al and the ranking algorithm described in Section IIIIBI The 
algorithm's efficiency is analyzed in Section IIII CI 



A. Measure of distance between a pair of traces 

We can obtain multiple entropy-based fingerprints for a trace by varying values of E, q, I 
and c. Denote a complete set of 4-tuples of [E, q, I, c] as M. We define the distance between 
a pair of traces ti and tj as: 



D{t il t j -M)= [J2 



,fc=i 



H Ek [a(U; l k , Cfc); q k ] - H Ek [a(tf, l k , c k ); q k ] 



max {H Ek [a(t; l k , c k ); q k }} 



(6) 



where m is the number of elements in M, w controlgjthe type of "norm" , max {H Ek [a(t; l k , c k ); q k }} 
denotes the maximum value of H Ek for the complete set of traces under study for a given 
q k , l k , and c k ; k indexes the 4-tuple [E k , q k ,l k , c k ] in M with E k , q k , l k , and c k indicating the 
entropy, q, /-word length and character type for the k-th 4-tuple, k — 1, . . . , m. 

This denominator is used as a normalization factor to set equal weights to fingerprints 
related to different 4-tuples in M. 



This is a simplified description of the analysis process. In practice the analyst will examine defects with 

similar symptoms, consult with her peers, search the database with descriptions of existing problems, etc. 
4 We do not have a theoretical rationale for selecting an optimal value of w; experimental analysis shows 

that our results are robust to the value of w (c.f., Appendix \X\ for more details). 



s 



Formula [6] satisfies three of the four usual conditions of a metric: 



D(ti,tf,M) > 0, 

D(t i ,t j ;M) = D(t j ,t i ;M), 

Dfa, t k - M) < D(ti, t k - M) + D(t k , tj\ M). 



(7) 



However, the fourth metric condition, D(ti,tj; M) = U = tj (identity of indis- 
cernibles), holds true only for the fingerprints of traces; the actual traces may be different 
even if their entropies are the same. In other words, the identity of indiscernibles axiom 
only "half" holds: t { = tj D{t h tf,M) = 0, but D(t h tj-,M) = ^ U = tj. As such, 
D represents a pseudo-metric. Note that D(ti,tj; M) G [0, oo) and our hypothesis is the 
following: the larger the value of D, the further apart are the traces. 

Note that for a single pair of entropy-based fingerprints the normalization factor in Equa- 
tion ([6]) can be omitted and we define D as 



Entropies have the drawback that they cannot differentiate dictionaries of events, since 
entropy formulas operate only with probabilities of events. Therefore, the entropies of the 
strings "fl-f2-f3-fl" and "f4-f5-f6-f4" will be exactly the same for any value of E, I, c, and 
q. The simplest solution is to do a pre-filtering of traces in T in the spirit of the SIFT 
framework described in Section [B For example, one can filter out all the traces that do 
not contain "characters" (e.g., function names) present in the trace of interest before using 
entropy-based fingerprints. 

We now define an algorithm for ranking a set of traces with respect to the trace of interest. 

B. Traces ranking algorithm 

Given a task of identifying the top X closest classes of traces from a set of traces T closest 
to trace t we employ the following algorithm: 

1. Calculate the distance between t and each trace in T; 

2. Sort the traces in T by their distance to trace t in ascending order; 



D(ti,tj;E,q,l,c) = \H E [a(ti,l,c);q] - H E [a{tj;l,c);q] \ . 



(8) 
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3. Replace the vector of sorted traces with the vector of classes (e.g., defect IDs) to which 
these traces map; 

4. Keep the first occurrence (i.e., the closest trace) of each class in the vector and discard 
the rest; 

5. Calculate the ranking of classes taking into account ties using the "modified competi- 
tion ranking" approach; 

6. Return a list of classes with ranking smaller than or equal to X. 

The "modified competition ranking" can be interpreted as a worst case scenario approach. 
The ordering of traces of equal ranks is arbitrary; therefore we examine the case when the 
most relevant trace will always reside at the bottom of the returned list. To be conservative, 
we consider the outcome in which our method returns a trace in the top X positions as 
being in the X-th position. 

Now consider an example of the algorithm. 

1. Traces ranking algorithm: example 



Assume we have five traces t i} i = 1..5 related to four software defects dj, j = 1..4 as 
shown in Table HD trace t 5 relates to defect d\, traces t\ and t 3 relate to defect d 2 \ trace i 4 
to defect d 3 ; and trace t 2 to defect d 4 . 

TABLE II. Example: Relation between traces and defects 

Defect Trace 

di t 5 

d 2 *ij*3 

d 3 i 4 

di t 2 



5 The "modified competition ranking" assigns the same rank to items deemed equal and leaves the ranking 
gap before the equally ranked items. For example, if A is ranked ahead of B and C (considered equal), 
which in turn are ranked ahead of D then the ranks are: A gets rank 1, B and C get rank 3, and D gets 
rank 4. 
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Suppose that we calculate distances between traces using some distance measure. The 
distances between (a potentially new) trace t and each trace in T — {ii, . . . ,£5} and the 
defects' ranks obtained using these hypothetical calculations are given in Table II III Trace t 2 
is the closest to t, hence d 4 (to which t 2 is related) gets ranking number 1. Traces t\ and t± 
have the same distance to t, therefore, d 2 and d 3 get the same rank. Based on the modified 
competition ranking schema algorithm we leave a gap before the set of items with the same 
rank and assign rank 3 to both traces. Traces £3 and £5 also have the same distance to t; 
however t% should be ignored since it relates to the already ranked defect d 2 . This assigns 
rank 4 to d\. The resulting sets of top X traces for different values of X are shown in Table 



TABLE III. Example: Traces sorted by distance and ranked 





Distance between 
t and ti 


Class (defect ID) 
of trace U 


Rank 


t 2 





d,\ 


1 


h 


7 


d 2 


3 


U 


7 


d 3 


3 


h 


9 


da 






9 


di 


4 



TABLE IV. Example: Top 1-4 defects 



Top X Set of defects in Top X 

Top 1 di 

Top 2 d 4 

Top 3 d 4 , d 2 , d 3 

Top 4 d 4 , d 2 , d 3 , d\ 
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C. Traces ranking algorithm: efficiency 



The number of operations C needed by the ranking algorithm is given by 

Step 1 Step 2 Step 3 

+ C4£^+c£(^+C6O(]0 
Step 4 Step 5 Step 6 

|T|-K», 

~ Cl>>{ ^ 4M ~cM\M\\T\) + ~c 2 0(\T\\og\T\), (9) 

V v ' V V ' 

Step 1 Step 2 

where q is a constant number of operations associated with i-th step, and | ■ | represents 
the number of elements in a given set. In practice, the coefficients £3, C4 and C5 are of much 
smaller order than C\ and hence terms corresponding to Steps 3, 4 and 5 do not contribute 
significantly to C. The pairwise distance calculation via Equation (J2]) requires 0(|M|) 
operations. Therefore, calculating all distances between traces (Step 1) requires 0(|M||T|) 
operations, so for fixed \M\ the number of operations grows linearly with |T|. The average 
sorting algorithm, required by Step 2 (sorting of traces by their distance to trace t), needs 
0(\T\ log|T|) operations J4]. Usually, c\ ^> c 2 ; this implies that a user may expect to see a 
linear relation between C and \T\ (even for large \T\), in spite of the loglinear complexity of 
the second term in Equation ([9]). 

The amount of storage needed for entropy-based fingerprints data (used by Equation (jHJ)) 
is proportional to 

4>\M\\T\ + 4>\M\ = (f)\M\(\T\ + l), (10) 

a b 

where <fi is the number of bytes needed to store a single fingerprint value. Term a is the 
amount of storage needed for entropy-based fingerprints for all traces in T, and term b is 
the amount of storage needed for the values of max {H Ek [a(t; Ik, gj} from Equation (jBj). 
Assuming that \M\ remains constant, the data size grows linearly with \T\. Now we will 
compare the complexity of our algorithm with an existing technique. 



1. Comparison with existing algorithm 



Note that a general approach for selecting a subset of closest traces to a given trace can 
be summarized as follows: (a) calculate the distance between t and each trace in T; and (b) 

12 



select a subset of closest traces. The principal difference between various approaches lies in 
the technique for calculating the distance between pairs of traces. 

Suppose that the distance between a pair of traces is calculated using the Levenshtein 



distance^ 21] . This distance can be calculated using the difference algorithm |27|]. The worst 
case complexity for comparing a pair of strings using this algorithm is O(NDl) {27], where 
N is the combined length of a pair strings and Dl is the Levenshtein distance between these 
two strings. Therefore, the complexity of step (a) using this technique is O(NDl)- 

Let us compare O(NDl) with the complexity of calculating the distance using the en- 
tropy based algorithm, namely 0(|M|). The former depends linearly on the length of the 
string representation of a trace ranging between 10° and 10 8 "characters" [241]. However, 
when strings are completely different, Dl = N and O(NDl) — > 0(N 2 ), implying quadratic 
dependency on the trace length. Conversely, \M\ (representing the quantity of scalar en- 
tropy values) should range between 10° and 10 2 . The computations are independent of trace 
size and require four mathematical function calls per scalar (see Equation (J6])) when effi- 
ciently implemented on modern hardware. Therefore, our algorithm requires several orders 
of magnitude fewer operations than the existing algorithms. 

In order to implement the Levenshtein distance algorithm we must preserve the original 
traces, requiring storage space for 10° to 10 8 "characters" of each trace 24J. An entropy 
based fingerprint, on the other hand, will need to store \M\ real numbers (|M| ranging 
from 10° to 10 2 ). This represents a reduction of several orders of magnitude in storage 
requirements. 



IV. VALIDATION CASE STUDY 



Our hypothesis is that the predictive classification power will vary with changes in E, 
I, c, and q. In order to study the classification power of He [cx(t; I, c); q] we will analyze 
Cartesian products of the following sets of variables: 

1. E E (S,L,R,T), 

2. /G(l,2,...,7), 

6 The Levenshtein distance between two strings is given by the minimum number of operations (insertion, 
deletion, and substitution) needed to transform one string into another. 
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3. q G (O^O- 5 ,^- 4 ,...,^ 1 ,^ 2 ), 

4. c G (F,FT,FTD). 

We denote by A the complete set of parameters obtained by the Cartesian product of the 
above sets. 

For our validation case study we use the Siemens test suite, developed by Hutchins et al. 
1161 . This suite was further augmented and made available to the public at the Software- 

nn 

artifact Infrastructure Repository [9|, 133[. This software suite has been used in many defect 
analysis studies over the last decade (see [17], [26| for literature review) and hence provides 
an example on which to test our algorithm. 



The Siemens suite 



16| contains seven programs. Each program has one original version 



and a number of faulty versions. Hutchins et al. 16[ created these faulty versions by making 
a single manual source code alteration to the original version. A fault can span multiple lines 
of source code and multiple functions. Every such fault was created to cause logical rather 
than execution errors. Thus, the faulty programs in the suite do not terminate prematurely 
after the execution of faulty code, they simply return incorrect output. Each program comes 
with a collection of test cases, applicable to all faulty versions and the original program. A 
fault can be identified if the output of a test case on the original version differs from the 
output of the same test case on a faulty version of the program. 

In this study, we experimented with the largest program ( "Replace" ) of the Siemens suite. 
It has 517 lines of code, 21 functions, and 31 different faulty versions. There were 5542 test 
cases shared across all the versions. Out of these 31 x 5542 test cases, 4266 (~ 2.5% of the 
total number of test cases) caused a program failure when exposed to the faulty program, 
i.e., were able to catch a defect. The remaining test cases were probably unrelated to the 31 



defects. The traces for failed test cases were collected using a tool called Etrace [13j]. The 
tool captures sequences of function-calls for a particular software execution such as the one 
shown in Figured] In other words, we collected 4266 function-call level failed traces for 31 
faults (faulty versions) of the "Replace" progra rrfl 

The distribution of the number of traces mapped to a particular defect (version) is given 
in Figure |2j Descriptive statistics of trace length are given in Table |V] The length ranges 
7 The "Replace" program had 32 faults, but the tool "Etrace" was unable to capture the traces of segmen- 
tation fault in one of the faulty versions of the "Replace" program. This problem was also reported by 
other researchers [17| . 
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between 11 and 101400 records per trace; with an average length of 623 records per trace. 
Average dictionary sizes for various values of c are given in Figure [3j Note that as I gets 
larger the dictionary sizes for all c start to converge. 



TABLE V. Descriptive statistics of length of traces 



Min. 


1 st Qu. 


Median 


Mean 


3 rd Qu. 


Max. 


11 


218 


380 


623.3 


678 


101400 



„ 00 - 



<D 
□ 



I 
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T 
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T 



T 
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FIG. 2. Distribution of the number of traces per defect (version) 

All of the traces contain at least one common function. Therefore, we skip the pre- 
filtering step. Note that direct comparison with existing trace comparison techniques is not 
possible since 1) the authors focus on identification of faulty functions [3, Q] instead of 
identification of defect IDs and 2) the authors analyze a complete set of programs in 
the Siemens suite while we focus only on one program (Replace). 

The case study is divided as follows: the individual classification power of each He \ot{t\ I, c); q] 
is analyzed in Section TlV A\ while Section TlV Bl analyzes the classification power of the com- 
plete set of entropies. Timing analysis of the algorithm is given in Section IIV C\ and threats 
to validity are discussed in Section IIVDI 
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FIG. 3. Dictionary size for various values of I and c. 
A. Analysis of individual entropies 

Analysis of the classification power of individual entropies is performed using 10-fold 
cross-validation. In this approach the sample set is partitioned into 10 disjoint subsets of 
equal size. A single subset is used as validation data for testing and the remaining nine 
subsets are used as training data. The process is repeated tenfold (ten times) using each 
data subset as the validation data just once. The results of the validation from each fold are 
averaged to produce a single estimate. This 10- fold repetition guards against the sampling 
bias which can be introduced through the use of the more traditional 1-fold validation in 
which 70% of the data is used for training and the remaining 30% is used for testing. The 
validation process is designed as follows: 

1. Randomly partition 4266 traces into 10 bins 

2. For each set of parameters E,l,c,q 

(a) For each bin 

i. Tag traces in a given bin as a validating set of data and traces in the remaining 
nine bins as a training set 
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ii. For each trace t in the validating set calculate the rank of t's class (defect ID) 



in the training set using the algorithm in Section IIII Bx 



with Equation (jSJ) as 



the measure of distance and with the set of parameters E, /, c, q 

(b) Compute summary statistics about ranks of the "true" classes and store this data 
for further analysis 

Our findings show that the best results are obtained for H with E G (L,R,T), I = 3, 
q G (10 _5 ,10 -4 ), and c = FTD. Based on 10-fold cross validation, the entropies with 
these parameters were able to correctly classify ~ 21.6% ± 1.1 %□ of the Top 1 defects and 
~ 57.6% ± 1.5% of the Top 5 defects (see Table IVT1 and Figure 0J. Based on the standard 
deviation data in Table IVlj all six entropies show robust results. However, the results become 
slightly more volatile for high ranks (see Figure [5]). We now analyze these findings in details. 




FIG. 4. Interpolated average fractions of correctly classified traces in the Top 5 (based on 10- fold 
cross validation) for E = L, I = 3, q = 10~ 5 , and c = FTD for different values of I and q. 

The 3-words (I = 3) provide the best results based on the fraction of correctly classified 
traces in the Top 5 (see Figure |6]) , suggesting that chains of three events provide an optimal 

8 Technically, in order to identify the true ranking one must tweak Step 6 of the algorithm and return a 

vector of 2-tuples [class, rank]. 

9 95% confidence interval, calculated as ±q(0.975, 9)/vl0 x standard deviation, where q(x,df) represents 

quantile function of the t-distribution, x is the probability, and df is the degrees of freedom. 
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Rank 



FIG. 5. Fraction of correctly classified traces in the Top 5 for E = L and c = FTD. The solid 
line shows the average fraction of correctly classified traces in 10 folds; the dotted line shows the 
pointwise 95% confidence interval (95% CI) of the average. 

balance between the amount of information in a given Z-word and the total number of words. 
As / gets larger, the amount of data becomes insufficient to obtain a good estimate of the 
probabilities. 

Examining the classification performance of c (see Table IVIII) we see essentially no differ- 
ences across the three levels of c (F, FT, and FTD) considered here (with E — L, I — 3, and 
q G {10~ 4 , 10~ 5 }). This is true across all values of X in the percentage of correctly classified 
traces in the Top X. This is somewhat surprising since c = FTD contains more information 
than c = FT which, in turn, contains more information than c = F. This suggests that 
for this data set and parameters (E = I, I = 3, and q G {10~ 4 , 10~ 5 }) the function names 
contain the relevant classification information, the function type (entry or exit) and depth 
providing no additional relevant information. Note that even though more time is needed 
to calculate the FTD-b&sed entropies (since the dictionary of FTDs will contain twice as 
many entries as the dictionary of FTs) the comparison time remains the same (since the 
probabilities of /-words, P, map to a scalar value via the entropy function for all values of 
c). 
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FIG. 6. The average fraction of traces correctly classified in the Top 5 for various values of I; 
E = L, q £ (0, 1(T 5 , 10°, 10 2 ), c = FTD. 

Our findings show that the extended entropies outperform the Shannon entropy 10 ! for 
q < 1 and q > 1 (see Figure Cj). However, performance of extended entropies with q < 1 is 
significantly better than with q > 1, suggesting that rare events are more important than 
frequent events for classification of defects in this dataset. The best results are obtained for 
q = lCr 4 and q = lO" 5 . 

It is interesting to note that classification performance is almost identical for H with 
E e (L, R, T), 1=3, q e (lCr 5 ,l(r 4 ), and c = FTD. We believe that this fact can be 
explained as follows: the key contribution to the ordering of similar traces (with similar 
dictionaries) for entropies with q — > is affected mainly by a function of probabilities of 
traces' events. This function is independent of E and q and depends only on I and c, see 
Appendix IB] for details. 



We do not explicitly mention entropy values on the figures. However, extended entropy values with q = 1 
correspond to values of the Shannon entropy. 
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FIG. 7. Average fraction of correctly classified traces in the Top 5 for various values of q; E = L, 
I £ (1,3,7), c = FTD. 

B. Analysis of the complete set of entropies 

Analysis of the classification power for the complete set of entropies is performed using 
10-fold cross-validation in a similar manner to the process described in Section IIV Al How- 
ever, instead of calculating distances for each H independently, we now calculate distances 
between traces by utilizing values of H for all parameter sets in A simultaneously The 
validation process is designed as follows 

1. Randomly partition 4266 traces into 10 bins 
(a) For each bin 

i. Tag traces in a given bin as a validating set of data and traces in the remaining 
nine bins training set 

ii. For each trace t in the validating set calculate the rank of t's class (defect 
ID) in the training set using the algorithm in Section [III Bl with Equation (JH]) 



and al 







4-tuples of parameters in A. (Based on the experiments described 



11 We had to exclude a subset of entropies with E = L, q = 10 2 for all I and c from A. The values of 

entropies obtained with these parameters are very large (> 10 100 ), which leads to numerical instability of 

Equation ©. We keep just one of the various named q = 1 entropies to avoid redundancy. 
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TABLE VI. Fraction of correctly classified traces in Top X for 1) He [ce(t; I, c); q] with E £ 
(L,R,T), q £ (10~ 5 , 1CT 4 ), I = 3, and c = FTD, and 2) set of entropies A; based on 10-fold cross 
validation. The average fraction of correctly classified traces in 10 folds is denoted by "Avg."; 



plus-minus 95% confidence interval is denoted by "95% CI" . 
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TABLE VII. Percent of correctly classified traces in Top X for He [a(t; I, c); q], E = L, I = 3, and 
q = {l(T 4 ,l(r 5 }. 



Top X c = F c = FT c = FTD 

Top 1 21.7% 20.9% 21.6% 

Top 2 37.2% 35.3% 36.2% 

Top 3 49.5% 46.7% 47.6% 

Top 4 54.0% 53.5% 54.8% 

Top 5 56.2% 56.6% 57.6% 



in Appendix lAl we set w = 1.) 

(b) Compute summary statistics about ranks of the "true" classes and store this data 
for further analysis 

The results, in the two right-most columns of Table IVI| show the increase of predictive 
power: in the case of the Top 1 the results improved from 21.6% (for individual entropies) 
to 29.8% (for all entropies combined); for the Top 5 from 57.6% to 62.1%. However, the 
503-fold increase in computational effort (the number of entropy fingerprints rise from 1 
to 504) yielded only an 8% increase in the power to predict Top 5 matches. We leave the 
resulting balance between cost and benefit for each individual analyst to make. 



C. Timing 

We compared the theoretical efficiency of our algorithm with existing algorithms in Sec- 
tion IIII C II These findings may be validated against experimental timing datsp We mea- 
sured the time needed to compare a reference trace against a complete set of traces using 
difference and entropy-based algorithms. The experiment is repeated three times for 
three reference traces: small (498 characters), medium (2339 characters), and large (20767 
characters)!^. The results are given in Table IVIIII As expected, the difference-based algo- 
rithm comparison time increases in proportion to the length of the reference trace, while 
the comparison time of the entropy-based algorithms remains constant across trace sizes. 

12 Computations were performed on a computer with Intel Core 2 Duo E6320 CPU. 

13 The reference traces were chosen arbitrarily. 



22 



Moreover, the comparison time of entropy-based algorithms is several orders of magnitude 
faster than of the difference-based approach. Note that, even though the computational 
efforts for measuring distance based on the complete set of fingerprints increases by two 
orders of magnitude as compared to individual entropies, the results are obtained in less 
than a second. Therefore, from a practical perspective, we can still use a complete set of 
entropies for our analysis. 



TABLE VIII. Timing results (in seconds) for comparison of a single reference trace against a set 
of traces using difference and entropy algorithms. 







Reference trace 




Algorithm 


Small 


Medium 


Large 




(498 characters) 


(2339 characters) 


(20767 characters) 


Difference 


2.7E1 


4.2E1 


1.9E2 


Individual entropy 


1.3E-4 


1.3E-4 


1.3E-4 


Complete set of entropies 


6.8E-2 


6.8E-2 


6.8E-2 



D. Threats to Validity 

A number of tests are used to determine the quality of case studies. In this section 



we discuss four core tests: construct, internal, statistical, and external validity [36|. The 
discussion highlights potential threats to validity of our case study and tactics that we used 
to mitigate the threats. 

Construct validity: to overcome potential construct validity issues we use two measures 
of classification performance (Top 1 and Top 5 correctly classified traces). Internal validity: 
to prevent data gathering issues all data collection was automated and a complete corpus 
of test cases was analyzed. Statistical validity: to prevent sampling bias, 10-fold cross 
validation was used. External validity: This case study shows that the method can be 
successfully applied to a particular, well-studied, data set. Following the paradigm of the 



"representative" or "typical" case advanced in 36|, this suggests that the method may also 
be useful in more situations. 
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V. SUMMARY 



In this work we analyze the applicability of a technique which uses entropies to per- 
form predictive classification of traces related to software defects. Our validating case 
study shows promising performance of extended entropies with emphasis on rare events 
(q G {1(T 5 , HT 4 }). The events are based on triplets (3-words) of "characters" incorporating 
information about function name, depth of function call, and type of probe point (c = FTD). 

In the future, we are planning to increase the number of datasets under study, derive 
additional measures of distance (e.g., using tree classification algorithms) and identify an 
optimal set of parameter combinations. 

Appendix A: Selection of w for Equation (1H1) 

We do not have a theoretical rationale for selection of the optimal value of w in Equa- 
tion (jH]). When \M\ = 1, Equation (jH]) simplifies to Equation ([H]), becoming independent 
of w. However, in the general case of \M\ > 1, w will affect performance of the distance 
metrics. In order to select an optimal value of w we performed the analysis of the complete 
set of entropies discussed in Section HVBI for w — 1,2, ... ,5. Table HXl gives the percentage 
of correctly classified traces in the Top 1 and the Top 5. The quality of classification does 
not change significantly with w. It degrades with increased w, but the small magnitude of 
this change makes us unwilling to consider it a result of the paper. 



TABLE IX. Percent of correctly classified traces in Top X for the complete set of entropies A and 
w = 1, 2, . . . , 5. 



■w 


1 


2 


3 


4 


5 


Top 1 


29.8% 


29.7% 


29.6% 


29.3% 


29.2% 


Top 5 


62.1% 


61.5% 


61.5% 


61.3% 


61.4% 



Appendix B: Approximation of Equation ([8j) 

We have observed that the classification power of the H E [a(t; I, c); q] metric is highest 
when g — 0. In order to explain this phenomenon we expand H E [a(t; I, c); q] (given in 

24 



Equation d3J)) in a Taylor series: 



H L [aft; l,c) ]q ] q =l-- + q(^ + *-± ) + 0(q^ 



rii \ nr rii 



H R [a{t i ;l,c);q} q = \og 2 {n i ) (Bl) 



+ q 



+ log 2 (^ 



+ 0(q 2 ), and 



_ln(2)n. 

H T [a(t t ; I, c); g] Q =° rn - 1 + g (A { + n - 1) + 0(g 2 ), 

where v4j = l n (Pfc)- By plugging (1B1I) into Equation (jSJ) and assuming that for similar 
fc=i 

traces n ~ n.j « nj, we get: 

D(t{, tj] L, q, I, c) ~ — | Aj — | , 

i£, Z, c) « ^-t|s— |A - A/| , and (B2) 
ln(2Jn 

D(ti,tj',T,q,l,c) Paq\Ai-Aj\. 

Equation (1B2|) can be interpreted as follows. In the case in which q — > and the dictio- 
naries for each trace in the pair are similar, the key contribution to the measure of distance 
is coming from the Y12Li ^ n {Pk) term (which depends only on I and c) making the rest of 
the variables irrelevant (q and n become parts of scaling factors). This can be highlighted 
by solving a system of equations to identify conditions that generate the same ordering for 



three traces ti,tj,t k for all extended entropies (using approximations from flB2j) ): 



— \Ai — AA $J — \ Ai 

q i a a i / s 



I Ai — Aj | ^ 



ln(2)n J ln(2)n 

qlAi-Ajl ^q\Ai-A k 



-A k \ 
Ai — AA 



(B3) 



| -^-i j | ^ | -^-i 



A, 



In information theory ln(pfc) measures the "surprise" (in bits) in receiving symbol k which 
is received with probability p k . Thus Y^k'=iPk hiQofc) is the expected surprise or information 
(Shannon entropy). What about just ^feli m (Pfc)^ ^ scales with the total number of bits 
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needed to specify each symbol. This is related to the problem of simulating processes in the 
presence of rare events, see for details. 
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